Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

220 ◾ Bioinformatics

-U data/ENCFF000XGP_inp0.fastq.gz \

-S bam/ENCFF000XGP_inp0.sam \

2> bam/inp0.log

bowtie2 \

-p 4 \

-x ref/hg19 \

-U data/ENCFF000XJP_chp1.fastq.gz \

-S bam/ENCFF000XJP_chp1.sam \

2> bam/chp1.log

bowtie2 \

-p 4 \

-x ref/hg19 \

-U data/ENCFF000XJS_chp2.fastq.gz \

-S bam/ENCFF000XJS_chp2.sam \

2> bam/chp2.log

bowtie2 \

-p 4 \

-x ref/hg19 \

-U data/ENCFF000XKD_chp3.fastq.gz \

-S bam/ENCFF000XKD_chp3.sam \

2> bam/chp3.log

The four SAM files produced by the above commands contain the alignment information

of the reads. However, they may also include alignment information that we do not need

and removing that will make us focus only on the regions of interest and also reduce the

computational complexity. We can remove the mitochondrion read alignments, which are

defined as “chrM” in the chromosome field of the SAM file and the unidentified, ran-

dom, and haploid reads, which are defined as “chrUn”, “random”, and “*hap*”, respectively,

keeping only the reads aligned to the human chromosomes. We can use “sed” Linux com-

mand to do that and the filtered alignments are saved in new files.

cd bam

sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XGP_inp0.sam >

ENCFF000XGP_inp0_filt.sam

sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XJP_chp1.sam >

ENCFF000XJP_chp1_filt.sam

sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XJS_chp2.sam >

ENCFF000XJS_chp2_filt.sam

sed ‘/chrM/d;/random/d;/chrUn/d;/hap/d’ ENCFF000XKD_chp3.sam >

ENCFF000XKD_chp3_filt.sam

We can then convert the SAM files into BAM files using “samtools view” command.

samtools view -S -b ENCFF000XGP_inp0_filt.sam > ENCFF000XGP_inp0_

filt.bam